This past November, New York City Mayor, Eric Adams, signed the 'Rat Action Plan' in an attempt to cut down on the current rodent population. So far this year, the city's Sanitation Department has reported more than 21,600 rat complaints, an increase of around 71% since October 2020. Many officials blame the Covid-19 Pandemic for this shocking increase, stating that work-from-home and outdoor dining policies have led to more trash bags, of which the rats make a nice meal, on the sidewalks and streets. The 'Rat Action Plan' outlines several new rat-minimizing rules:
In addition, the New York City Sanitation Department has stated that effective April 1, 2023, citizens will be fined for putting their trash out before 8 p.m. for the overnight trash collectors. By trying to minimize the amount of time the rat's main food source is available, they hope to reduce the population. Source: https://www1.nyc.gov/assets/dsny/site/resources/press-releases/department-of-sanitation-publishes-final-rules-to-drastically-reduce-hours-trash-will-sit-on-nyc-streets
We were very interested in exploring the data collected by the Department of Sanitation, particularly about location data. We originally were very curious to see if certain areas of the city had more of a problem than others, and if so, why? If there were some areas with far fewer rats, what is different about that area, or what are policymakers doing differently there? To begin to evaluate these questions, lets take a look at our data.
For this project, we are using Python 3 and several imported libraries shown below.
#Libraries and imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import IFrame
import folium
import geopandas as gpd
from shapely.geometry import Point, Polygon
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")
Our first step of this process is to search for relevant data. We found a csv file of rat sightings from the year 2010 to 2022 on https://data.cityofnewyork.us/. This data is updated daily and gathered from Sanitation Department service requests. We will be using this to get location data on each rat sighting.
rat_path = "Rat_Sightings.csv"
df = pd.read_csv(rat_path, low_memory=False)
df = df[["Created Date", "Location Type", "Incident Zip", "Incident Address", "Borough", "Latitude", "Longitude"]]
df.sort_values(by=['Created Date'], inplace=True)
Now we can add a year and month column from 'Created Date', so we can extract data pertaining to specific time periods.
format = '%m/%d/%Y %H:%M:%S %p'
# Extracting the year and month from 'Created Date' & making their own columns
df['Year'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).year)
df['Month'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).month)
df.head()
| Created Date | Location Type | Incident Zip | Incident Address | Borough | Latitude | Longitude | Year | Month | |
|---|---|---|---|---|---|---|---|---|---|
| 159386 | 01/01/2010 02:15:27 PM | Vacant Building | 11218.0 | 900 CONEY ISLAND AVENUE | BROOKLYN | 40.635654 | -73.967757 | 2010 | 1 |
| 158069 | 01/01/2010 03:05:37 PM | 3+ Family Apt. Building | 11377.0 | 31-14 58 STREET | QUEENS | 40.756987 | -73.903618 | 2010 | 1 |
| 158077 | 01/01/2010 04:14:27 PM | 3+ Family Apt. Building | 10467.0 | 2504 BRONX PARK EAST | BRONX | 40.863614 | -73.870441 | 2010 | 1 |
| 159170 | 01/01/2010 08:29:58 AM | 3+ Family Apt. Building | 11206.0 | 202 PULASKI STREET | BROOKLYN | 40.692989 | -73.943771 | 2010 | 1 |
| 54396 | 01/01/2010 08:52:19 PM | Other (Explain Below) | 11201.0 | NaN | BROOKLYN | 40.688903 | -73.980929 | 2010 | 1 |
This looks great! Our dataset has the following 10 columns:
We can use the Year and Month data to look at how rat sightings change over time. The Location Type column can help us determine what types of places have the most frequent sightings. Zip Code, Borough, and Lat/Long will allow us to visualize and map exactly where the rats are.
To visualize the data, it will be beneficial to plot these coordinates. Any NaN values in the Latitude or Longitude columns will cause an error when we try to create a map. Therefore, we must remove rows that contain NaN values in these columns. Let's see how many rows are missing coordinate values:
missing_lats = len(df.index) - df[pd.notnull(df["Latitude"])]["Latitude"].count()
missing_longs = len(df.index) - df[pd.notnull(df["Longitude"])]["Longitude"].count()
print("Number of Latitude Values Missing:", missing_lats)
print("Number of Longitude Values Missing:", missing_longs)
Number of Latitude Values Missing: 2160 Number of Longitude Values Missing: 2160
We can easily drop the 2,160 columns which are missing location, however may need these columns when we conduct borough analysis. To preserve this data we can just duplicate the dataframe before we drop these rows.
# coord_df is a dataframe that is missing the rows that don't have coordinate data
coord_df = df
# dropping the rows with missing coordinate data
coord_df = df.dropna(subset=["Latitude"])
coord_df = df.dropna(subset=["Longitude"])
The following code generates a map of NYC's rat sightings using coordinates from 2010-2021. There are two ways we can make a map: with folium or geopandas. Let's see which map is best for visualizing rat sighting data in NYC.
# geopandas allows us to plot of the coordinates in our dataframe
street_map = gpd.read_file("nyu_2451_34490")
geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
geo_df = gpd.GeoDataFrame(df,
crs = ('EPSG:4326'),
geometry = geometry)
fig, ax = plt.subplots(figsize = (10,10))
street_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
geo_df.plot(ax=ax, cmap = 'RdPu', alpha = .5)
ax.set_title('NYC Rat Incident Heatmap')
Text(0.5, 1.0, 'NYC Rat Incident Heatmap')
This approach help us visualize denisity throughout NYC. However, this may be too much information on one heatmap. Too many coordinate points make it difficult to deduct meaning from the heatmap. We don't know how time has influenced the location of rat sightings and there are so many coordinates that the majority of the map is the same dark purple color.
What is the best way of visualizing rat sightings for every year without becoming overwhelmed with coordinate points?
First let's make the heatmap with multiple colors to indicate density.
# function to generate base map, has default values for zoom and tiles
def generateBaseMap(loc, zoom=10, tiles='Stamen Toner', crs='ESPG2263'):
'''
Function that generates a Folium base map
Input location lat/long
Zoom level default 12
Tiles default to Stamen Toner
CRS default 2263 for NYC
'''
return folium.Map(location=loc,
control_scale=True,
zoom_start=zoom,
tiles=tiles)
nyc = [40.7400, -73.985880] # generic nyc lat/lon in list format
base_map = generateBaseMap(nyc) # pass lat/long to function
base_map
# create count column and populate with 1 for grouping and summing
coord_df.loc[:, 'Count'] = 1
# view latitude, longitude and count columns, groupby latitude and longitude summing up total for each xy coordinate
df2 = coord_df.copy()
df2 = pd.DataFrame(coord_df.groupby(['Latitude', 'Longitude'])['Count'].sum().sort_values(ascending=False))
# create list of lat/long for input into folium HeatMap
lst = df2.groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist()
# import HeatMap plugin
from folium.plugins import HeatMap
# add data to basemap which we created above with custom function
HeatMap(data=lst, radius=10).add_to(base_map);
# save base map as .html
base_map.save('./rat_sighting_HeatMap.html')
# call map
base_map
This graph is helpful because you can zoom in to see the rat sighting hotspots throughout the city from 2010-2022. These hotspots are centered around the NYC Theater District, [THE REST OF THE HOTSPOTS LMAO]. These are popular sites for tourism.
However, without the passage of time be shown on the map, we lose a significant dimension in the data. Let's make a timelapsed heatmap to demonstrate the progression of rat sightings over time.
# create blank list
df_year_list = []
# loop through each month
for year in coord_df['Year'].sort_values().unique():
# for each hour append to list
# sum totals per station, reset index and create list
df_year_list.append(coord_df.loc[coord_df['Year'] == year, ['Latitude', 'Longitude', 'Count']].groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist())
from folium.plugins import HeatMapWithTime
# create new base map for heat map with time
base_map_2 = generateBaseMap(nyc)
# create a more meaningful index for heat map with time
start = datetime(2010, 1, 1)
end = datetime(2022, 12, 31)
# use pandas daterange function to generate date range object
daterange = pd.date_range(start=start, end=end, periods=13)
# format time with month and year
time_index = [d.strftime("%Y") for d in daterange]
# instantiate HeatMapWithTime
HeatMapWithTime(df_year_list, radius=11, index=time_index,
gradient={0.1: 'blue', 0.5: 'lime', 0.7: 'orange', 1: 'red'},
min_opacity=0.4, max_opacity=0.8, use_local_extrema=True).add_to(base_map_2)
# save as html
base_map_2.save('./heatmapwithtime_ratsightings.html')
# call result
base_map_2
[TALK ABOUT TIMELAPSE]
Another approach is to compare the frequency of rat sightings over time. We can make a scatter plot that visualizes this.
from datetime import datetime
format = '%m/%d/%Y %H:%M:%S %p'
# Extracting the year and month from 'Created Date' & making their own columns
df['year'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).year)
df['month'] = df['Created Date'].apply(lambda x: datetime.strptime(x, format).month)
# The frequency array keeps track of frequency per year.
# We can use this later to make a scatter plot
frequency = []
# Each 'year' array contains:
# index 0 ==> year_value (e.g 2011)
# index 1 ==> array of rat sighting listings
for year in df.groupby('year'):
frequency.append(len(year[1]))
years = df.Year.unique()
# This dataframe contains 12 rows & two columns representing # of rat sightings per year (frequency)
year_freq_df = pd.DataFrame({'Year': years, 'frequency': frequency})
# Plotting the graph
ax = df.hist(column='Year', bins=25, grid=False, color='#86bf91')
ax = ax[0]
for x in ax:
x.set_title("Frequency per Year", size=12)
# Set x-axis label
x.set_xlabel("Year", labelpad=20, size=12)
# Set y-axis label
x.set_ylabel("Frequency", labelpad=20, size=12)
Now that we have seen the distribution of sightings over years, let's explore how the sightings are distributed throughout the year. Below, we will graph the sighting counts for each month of the year to see if there are certain months or seasons in which sightings are more common.
ax = df['Month'].value_counts().sort_index().plot(kind='bar', title='Time of Year', color='#86bf91')
ax.set_xlabel("Month")
ax.set_ylabel("Number of Sightings")
Text(0, 0.5, 'Number of Sightings')
NYC is one of the largest cities in the world. For administrative purposes, the city was split into five boroughs in 1898: Bronx, Brooklyn, Manhattan, Queens, and Staten Island.
Which borough has the highest number of rat sightings? We can easily determine this by graphing a scatter plot of frequency per borough.
# The frequency array keeps track of frequency per year.
# We can use this later to make a scatter plot
b_frequency = []
boroughs = []
for borough in df.groupby('Borough'):
if borough[0] != 'Unspecified':
# Adding the frequency for each borough to an array
borough_freq = len(borough[1])
b_frequency.append(borough_freq)
boroughs.append(borough[0])
borough_freq_df = pd.DataFrame({'boroughs': boroughs, 'frequency': b_frequency})
#print(borough_freq_df)
# Plotting the graph
graph = borough_freq_df.plot.bar('boroughs', 'frequency', color='#86bf91')
# Creating the title, limits, and labels
graph.set_title('NYC Rat Sighting Frequency For Each Borough')
graph.set_xlabel("Borough")
graph.set_ylabel("Frequency")
plt.show()
#Pie chart
borough_density_df = borough_freq_df.copy()
borough_freq_df.set_index(['boroughs'], inplace=True)
colors = ['#2A2D34', '#009DDC', '#F26430', '#6761A8', '#009B72']
graph = borough_freq_df.plot.pie(y='frequency', legend = False, figsize=(7, 7), autopct='%1.1f%%', title="\nNYC Rat Sighting Percentages For Each Borough", startangle=0, colors=colors)
As we can see, Brooklyn has the highest rat sighting frequency, however it is one of the larger boroughs in NYC so is to be expected. When comparing the number of rat sightings for each borough, we must also take the borough's size into consideration. How can we measure frequency while also incorporating a bourough's square mileage?
NYC Square Mileage Per Borough:
We can divide the frequency by the square mileage to create a density value.
sq_mile_info = [42.1, 70.82, 22.83, 108.53, 58.37]
# added columns that indicate the square mileage and density for each borough
borough_density_df['sq_mile'] = sq_mile_info
borough_density_df['density'] = borough_density_df['frequency'] / borough_density_df['sq_mile']
print(borough_density_df)
boroughs frequency sq_mile density 0 BRONX 38742 42.10 920.237530 1 BROOKLYN 74547 70.82 1052.626377 2 MANHATTAN 54784 22.83 2399.649584 3 QUEENS 30580 108.53 281.765410 4 STATEN ISLAND 8586 58.37 147.096111
# Plotting the graph
graph = borough_density_df.plot.bar('boroughs', 'density', color='#86bf91')
# Creating the title, limits, and labels
graph.set_title('NYC Rat Sighting Density for Each Borough')
graph.set_xlabel("Borough")
graph.set_ylabel("Density")
plt.show()
#Pie chart
borough_density_df.set_index(['boroughs'], inplace=True)
colors = ['#2A2D34', '#009DDC', '#F26430', '#6761A8', '#009B72']
graph = borough_density_df.plot.pie(y='density', legend = False, figsize=(7, 7), autopct='%1.1f%%', title="\nNYC Rat Sighting Density Percentages For Each Borough", startangle=0, colors=colors)
Manhattan is the borough with the highest number of rat sightings per square mile (density), with more than double the density of Brooklyn. This could be due to the high level of tourism that takes place in Manhattan. Manhattan is the most famous and most the visited borough, so there could be a correlation with tourism and rat density.
How do we measure tourism? What other factors could be in play here? Let's think of some related _(idk the word)__:
We can get a better idea of what causes a substantial amount of rat sightings in an area by looking at where the rats are being seen.
Let's take a closer look at the specific types of places where these rats are being spotted. To do this, we have to combine and condense certain location entries to make our data usable.
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['3+ Family Apt. Building','3+ Family Apartment Building','Apartment','3+ Family Apt.', '3+Family Apt.', '3+ Family Apt'], 'Apartment Building'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['1-3 Family Mixed Use Building', '1-2 Family Mixed Use Building', '3+ Family Mixed Use Building'], 'Mixed Use Building'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['1-2 FamilyDwelling', '1-3 Family Dwelling', '1-2 Family Dwelling'], 'Family Dwelling'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Restaurant/Bar/Deli/Bakery', 'Store', 'Commercial Building', 'Catering Service', 'Retail Store', 'Restaurant', 'Grocery Store'], 'Commercial Property'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Abandoned Building', 'Vacant Building', 'Vacant Lot'], 'Vacant Lot/Property'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Street Fair Vendor', 'Ground', 'Street Area'], 'Street'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Summer Camp', 'Cafeteria - Public School', 'School/Pre-School'], 'School'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Residential Property','Residence','Private House','Single Room Occupancy (SRO)','Mixed Use Building', 'Family Dwelling', 'Apartment Building'], 'Residential Building'))
df['Location Type'] = df['Location Type'].replace(dict.fromkeys(['Other (Explain Below)'], 'Other'))
df['Location Type'].value_counts()
Residential Building 143742 Other 31527 Commercial Property 11042 Vacant Lot/Property 9093 Construction Site 4718 Parking Lot/Garage 2133 Catch Basin/Sewer 2069 Public Garden 902 Government Building 464 Street 442 School 361 Day Care/Nursery 219 Office Building 193 Public Stairs 178 Hospital 133 Building (Non-Residential) 23 Beach 2 Name: Location Type, dtype: int64
There were several entries spelled differently, and many that we could combine into larger categories (e.g. Abandoned Building, Vacant Building, Vacant Lot into Vacant Lot/Property). Now that we have a smaller number of unique entries to work with, we can take a look at the most common locations.
#Location of sightings
ax = df['Location Type'].value_counts().plot(kind='bar', title='Location of Rat Sightings', color='#86bf91')
ax.set_xlabel("Locations")
ax.set_ylabel("Number of Sightings")
Text(0, 0.5, 'Number of Sightings')
We can see from the cumulative graph that Residential buildings are by far the most common location to see rats. The next common locations are Other, which does not give us information, Commercial Property, and Vacant Lot/Property. Note: This is only telling us where sightings are taking place, not where the rats are living.
Residential buildings consist of private homes, apartments, and mixed-use buildings (possibly hotels), all places that tourists may reside during their stay at NYC. However, the number of long-term residents may also have influence of the amount of rat sightings.
Knowing this information, we believe the rat sightings could be correlated with two features: resident population size & tourism. Let's investigate the cause of rat sightings through those two perspectives.
Now that we have taken a closer look at the rat sightings data, it is time to look at income in NYC. We are particularly interested in finding the median income per zip code and comparing this to the rat sightings in these areas. To find this data, we looked at https://data.cccnewyork.org/. From this site, we were able to find data for the years 2017, 2018, and 2019.
We hypothesize that there will be fewer rat sightings in zipcodes with higher median income.
To perform a regression, we must import and clean up the data.
income_path = "Median_Incomes.csv"
income_df = pd.read_csv(income_path, low_memory=False)
income_df = income_df[income_df.Type == "All Households"]
income_df.drop(columns=['Zip Code XXXXX', 'Currency'], inplace=True)
income_df.sort_values(by=['Year', 'Zip_Code'], inplace=True)
income_df.Year.unique()
income_dict = {2017: {}, 2018: {}, 2019: {}}
for index, entry in income_df.iterrows():
if entry.Year == 2017:
income_dict[2017][entry.Zip_Code] = entry.Amount
if entry.Year == 2018:
income_dict[2018][entry.Zip_Code] = entry.Amount
if entry.Year == 2019:
income_dict[2019][entry.Zip_Code] = entry.Amount
#Dropping rows without zips
coord_df = df.dropna(subset=["Incident Zip"])
missing_lats = len(coord_df.index) - coord_df[pd.notnull(coord_df["Incident Zip"])]["Incident Zip"].count()
df = coord_df
#Changing zipcodes to integers
df['Incident Zip'] = df['Incident Zip'].astype(int)
#New income column
df.loc[:,'Income'] = 0
for index, rat in df.iterrows():
if rat['Year'] >= 2017 and rat['Year'] <= 2019:
df.at[index, 'Income'] = income_dict[rat['Year']].get(rat['Incident Zip'], np.nan)
Now, we can create a dataframe which contains median incomes and their frequencies in the rat sightings data.
zip_df = df[df['Year'] >= 2017]
zip_df = zip_df[zip_df['Year'] <= 2019]
#Removing rows without a zipcode
missing_lats = len(zip_df.index) - zip_df[pd.notnull(zip_df["Income"])]["Income"].count()
zip_df = zip_df.dropna(subset=["Income"])
#DF with income and sighting frequency
freq_df = zip_df['Income'].value_counts().rename_axis('Income').reset_index(name='counts')
freq_df
| Income | counts | |
|---|---|---|
| 0 | 91846.00000 | 637 |
| 1 | 82405.52493 | 625 |
| 2 | 87486.87128 | 533 |
| 3 | 58430.50554 | 518 |
| 4 | 49195.30471 | 500 |
| ... | ... | ... |
| 525 | 126631.46840 | 1 |
| 526 | 129344.00000 | 1 |
| 527 | 105456.65220 | 1 |
| 528 | 104076.69810 | 1 |
| 529 | 143353.67870 | 1 |
530 rows × 2 columns
Lets create and evaluate a regression model.
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#Creating X and Y using Income and Frequency of Rat Sightings
x, y = freq_df['Income'].values.reshape(-1, 1), freq_df['counts'].values.reshape(-1, 1)
#Fitting a regression model
model = LinearRegression().fit(x,y)
Y_pred = model.predict(x)
plt.scatter(x,y)
plt.plot(x, Y_pred, color='red')
plt.title("Median Income vs. Frequency of Rats Per Zipcode")
plt.xlabel(" Median Income")
plt.ylabel("Number of Rats")
plt.show()
#Stats for regression model
import statsmodels.api as sm
X = freq_df[['Income']].values
X = sm.add_constant(X)
y = freq_df['counts'].values
model = sm.OLS(y, X)
results = model.fit()
results.summary()
| Dep. Variable: | y | R-squared: | 0.085 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.084 |
| Method: | Least Squares | F-statistic: | 49.21 |
| Date: | Mon, 12 Dec 2022 | Prob (F-statistic): | 7.08e-12 |
| Time: | 14:19:17 | Log-Likelihood: | -3179.1 |
| No. Observations: | 530 | AIC: | 6362. |
| Df Residuals: | 528 | BIC: | 6371. |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 160.1522 | 9.605 | 16.675 | 0.000 | 141.284 | 179.020 |
| x1 | -0.0008 | 0.000 | -7.015 | 0.000 | -0.001 | -0.001 |
| Omnibus: | 249.628 | Durbin-Watson: | 0.168 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 1198.268 |
| Skew: | 2.112 | Prob(JB): | 6.30e-261 |
| Kurtosis: | 9.035 | Cond. No. | 1.91e+05 |
#Creating training and test sets
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size =0.2)
clf = LinearRegression()
clf.fit(x_train,y_train)
print("Accuracy of trained model:", clf.score(x_test,y_test))
Accuracy of trained model: 0.10215612026885934
rest_path = "DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
rest_df = pd.read_csv(rest_path, low_memory=False)
rest_df = rest_df[["DBA", "BORO", "ZIPCODE", "GRADE", "RECORD DATE"]]
format = '%m/%d/%Y'
# Extracting the year and month from 'Created Date' & making their own columns
rest_df['Year'] = rest_df['RECORD DATE'].apply(lambda x: datetime.strptime(x, format).year)
rest_df = rest_df[rest_df['Year'] >= 2010]
rest_df = rest_df.dropna(subset=["ZIPCODE"])
rest_df['ZIPCODE'] = rest_df['ZIPCODE'].astype(int)
rest_df
| DBA | BORO | ZIPCODE | GRADE | RECORD DATE | Year | |
|---|---|---|---|---|---|---|
| 0 | NaN | Queens | 11104 | NaN | 12/10/2022 | 2022 |
| 1 | SAFUR | Queens | 11101 | NaN | 12/10/2022 | 2022 |
| 2 | GLAZE BRYANT PARK | Manhattan | 10018 | NaN | 12/10/2022 | 2022 |
| 3 | HOEK PIZZERIA | Brooklyn | 11201 | NaN | 12/10/2022 | 2022 |
| 4 | PANINI LA CAFE | Brooklyn | 11211 | Z | 12/10/2022 | 2022 |
| ... | ... | ... | ... | ... | ... | ... |
| 230706 | CHARLEY ST | Manhattan | 10012 | NaN | 12/10/2022 | 2022 |
| 230707 | SEAMORE'S | Brooklyn | 11201 | A | 12/10/2022 | 2022 |
| 230708 | SUMA PIZZA | Bronx | 10451 | A | 12/10/2022 | 2022 |
| 230709 | IZAKAYA FUKU | Queens | 11372 | NaN | 12/10/2022 | 2022 |
| 230710 | WOK WOK RESTAURANT | Manhattan | 10013 | NaN | 12/10/2022 | 2022 |
227499 rows × 6 columns
rest_freq_df = rest_df['ZIPCODE'].value_counts().rename_axis('Zip').reset_index(name='Rest_Counts')
rest_freq_df['Rat_Count'] = rest_freq_df['Zip'].apply(lambda z: len(df[df['Incident Zip'] == z]))
rest_freq_df =rest_freq_df[rest_freq_df['Rat_Count'] !=0]
rest_freq_df
| Zip | Rest_Counts | Rat_Count | |
|---|---|---|---|
| 0 | 10003 | 5589 | 1312 |
| 1 | 10013 | 5140 | 1202 |
| 2 | 10019 | 4967 | 848 |
| 3 | 10036 | 4466 | 907 |
| 4 | 10002 | 4433 | 2242 |
| ... | ... | ... | ... |
| 200 | 12345 | 11 | 3 |
| 204 | 10168 | 9 | 1 |
| 207 | 11241 | 6 | 1 |
| 214 | 10048 | 4 | 1 |
| 216 | 10151 | 3 | 1 |
196 rows × 3 columns
x, y = rest_freq_df['Rest_Counts'].values.reshape(-1, 1), rest_freq_df['Rat_Count'].values.reshape(-1, 1)
model = LinearRegression().fit(x,y)
r_sq = model.score(x, y)
Y_pred = model.predict(x)
plt.scatter(x,y)
plt.plot(x, Y_pred, color='red')
plt.title("Number of Restaurants vs. Frequency of Rats Per Zipcode")
plt.xlabel("Number of Restaurants")
plt.ylabel("Number of Rats")
plt.show()
import statsmodels.api as sm
X = rest_freq_df[['Rest_Counts']].values
X = sm.add_constant(X)
y = rest_freq_df['Rat_Count'].values
model = sm.OLS(y, X)
results = model.fit()
results.summary()
| Dep. Variable: | y | R-squared: | 0.122 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.118 |
| Method: | Least Squares | F-statistic: | 27.08 |
| Date: | Mon, 12 Dec 2022 | Prob (F-statistic): | 4.95e-07 |
| Time: | 14:32:36 | Log-Likelihood: | -1641.1 |
| No. Observations: | 196 | AIC: | 3286. |
| Df Residuals: | 194 | BIC: | 3293. |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 645.0690 | 108.888 | 5.924 | 0.000 | 430.314 | 859.825 |
| x1 | 0.3537 | 0.068 | 5.204 | 0.000 | 0.220 | 0.488 |
| Omnibus: | 83.268 | Durbin-Watson: | 1.392 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 244.794 |
| Skew: | 1.831 | Prob(JB): | 6.98e-54 |
| Kurtosis: | 7.069 | Cond. No. | 2.32e+03 |
#Creating training and test sets
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size =0.2)
clf = LinearRegression()
clf.fit(x_train,y_train)
print("Accuracy of trained model:", clf.score(x_test,y_test))
Accuracy of trained model: 0.1536645021455204
pop_path = "zipcode_populations.csv"
zipcode_pop_df = pd.read_csv(pop_path, low_memory=False)
# Sorting the dataframe by zipcode & removing non-NYC data
zipcode_pop_df.sort_values(by=['zip'], inplace=True)
zipcode_pop_df = zipcode_pop_df[zipcode_pop_df['city'] == "New York City"]
zipcode_pop_df.head()
| zip | population | city | county | |
|---|---|---|---|---|
| 260 | 10001 | 25026 | New York City | New York |
| 35 | 10002 | 74363 | New York City | New York |
| 73 | 10003 | 54671 | New York City | New York |
| 855 | 10004 | 3310 | New York City | New York |
| 552 | 10005 | 8664 | New York City | New York |